Section 9 - Get group-wise insights: group_by

Unite and conquer using group_by

group_by() lets you define groups within your data set. Its influence becomes clear when calling summarize() on a grouped dataset: summarizing statistics are calculated for the different groups separately.

In this exercise, you are going to create an ordered per-carrier summary of hflights by combining group_by(), summarize() and arrange().

library(hflights)

hflights_df <- hflights[sample(nrow(hflights), 720), ] 
hflights <- as_tibble(hflights)

# Rename Carrier with long names
lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental", 
         "DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways", 
         "WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier", 
         "FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")

hflights$UniqueCarrier<- lut[hflights$UniqueCarrier]
 
# Make an ordered per-carrier summary of hflights
hflights %>%
  group_by(UniqueCarrier) %>%
  summarize(
    p_canc = 100 * mean(Cancelled == 1),
    avg_delay = mean(ArrDelay, na.rm = TRUE)
  ) %>%
  arrange(avg_delay, p_canc)
## # A tibble: 15 x 3
##    UniqueCarrier      p_canc avg_delay
##    <chr>               <dbl>     <dbl>
##  1 US_Airways          1.13     -0.631
##  2 American            1.85      0.892
##  3 AirTran             0.982     1.85 
##  4 Alaska              0         3.19 
##  5 Mesa                1.27      4.01 
##  6 Delta               1.59      6.08 
##  7 Continental         0.678     6.10 
##  8 American_Eagle      2.90      7.15 
##  9 Atlantic_Southeast  3.45      7.26 
## 10 Southwest           1.55      7.59 
## 11 Frontier            0.716     7.67 
## 12 ExpressJet          1.55      8.19 
## 13 SkyWest             1.39      8.69 
## 14 JetBlue             2.59      9.86 
## 15 United              1.64     10.5

Combine group_by with mutate

You can also combine group_by() with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within-group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the smallest to the largest.

# Ordered overview of average arrival delays per carrier
hflights %>%
    filter(!is.na(ArrDelay), ArrDelay > 0) %>%
    group_by(UniqueCarrier) %>%
    summarize(avg = mean(ArrDelay))%>%
    mutate(rank = rank(avg)) %>%
    arrange(rank)
## # A tibble: 15 x 3
##    UniqueCarrier        avg  rank
##    <chr>              <dbl> <dbl>
##  1 Mesa                18.7     1
##  2 Frontier            18.7     2
##  3 US_Airways          20.7     3
##  4 Continental         22.1     4
##  5 Alaska              22.9     5
##  6 SkyWest             24.1     6
##  7 ExpressJet          24.2     7
##  8 Southwest           25.3     8
##  9 AirTran             27.9     9
## 10 American            28.5    10
## 11 Delta               32.1    11
## 12 United              32.5    12
## 13 American_Eagle      38.8    13
## 14 Atlantic_Southeast  40.2    14
## 15 JetBlue             45.5    15

Advanced group_by exercises

By now you’ve learned the fundamentals of dplyr: the five data manipulation verbs and the additional group_by() function to discover interesting group-wise statistics. The next challenges are an all-encompassing review of the concepts you have learned about. We already provided you with a template of the piped call that can solve the exercises. Up to you to finish all dplyr calls! For simplicity, you can include cancelled flights in your answers, so you shouldn’t filter based on the Cancelled column.

# How many airplanes only flew to one destination?
hflights %>%
  group_by(TailNum) %>%
  summarize(ndest = n_distinct(Dest)) %>%
  filter(ndest == 1) %>%
  summarize(nplanes = n())
## # A tibble: 1 x 1
##   nplanes
##     <int>
## 1    1526
# Find the most visited destination for each carrier
hflights %>%
  group_by(UniqueCarrier, Dest) %>%
  summarize(n = n()) %>%
  mutate(rank = rank(desc(n))) %>%
  filter(rank == 1)
## # A tibble: 15 x 4
## # Groups:   UniqueCarrier [15]
##    UniqueCarrier      Dest      n  rank
##    <chr>              <chr> <int> <dbl>
##  1 AirTran            ATL    2029     1
##  2 Alaska             SEA     365     1
##  3 American           DFW    2105     1
##  4 American_Eagle     DFW    2424     1
##  5 Atlantic_Southeast DTW     851     1
##  6 Continental        EWR    3924     1
##  7 Delta              ATL    2396     1
##  8 ExpressJet         CRP    3175     1
##  9 Frontier           DEN     837     1
## 10 JetBlue            JFK     695     1
## 11 Mesa               CLT      71     1
## 12 SkyWest            COS    1335     1
## 13 Southwest          DAL    8243     1
## 14 United             SFO     643     1
## 15 US_Airways         CLT    2212     1

Section 10 - dplyr and databases

dplyr deals with different types

hflights2 is a copy of hflights that is saved as a data table. hflights2 was made available in the background using the following code:

library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
hflights2 <- as.data.table(hflights)

hflights2 contains all of the same information as hflights, but the information is stored in a different data structure. You can see this structure by typing hflights2 at the command line.

Even though hflights2 is a different data structure, you can use the same dplyr functions to manipulate hflights2 as you used to manipulate hflights.

# Use summarize to calculate n_carrier
hflights2 %>%
    summarize(n_carrier = n_distinct(UniqueCarrier))
##   n_carrier
## 1        15

dplyr and mySQL databases

DataCamp hosts a mySQL database with data about flights that departed from New York City in 2013. The data is similar to the data in hflights, but it does not contain information about cancellations or diversions. With the tbl() function, we already created a reference to a table in this information.

Although nycflights is a reference to data that lives outside of R, you can use the dplyr commands on them as usual. Behind the scenes, dplyr will convert the commands to the database’s native language (in this case, SQL), and return the results. This allows you to pull data that is too large to fit in R: only the fraction of the data that you need will actually be downloaded into R, which will usually fit into R without memory issues.

library(RMySQL)
## Loading required package: DBI
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
# Set up a connection to the mysql database
my_db <- src_mysql(dbname = "dplyr", 
                   host = "courses.csrrinzqubik.us-east-1.rds.amazonaws.com", 
                   port = 3306, 
                   user = "student",
                   password = "datacamp")

# Reference a table within that source: nycflights
nycflights <- tbl(my_db, "dplyr")

# glimpse at nycflights
glimpse(nycflights)
## Observations: ??
## Variables: 17
## Database: mysql 5.6.34-log [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## $ id        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ month     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ day       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ dep_time  <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 55...
## $ dep_delay <int> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2,...
## $ arr_time  <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 8...
## $ arr_delay <int> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7,...
## $ carrier   <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6"...
## $ tailnum   <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N...
## $ flight    <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301...
## $ origin    <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LG...
## $ dest      <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IA...
## $ air_time  <int> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149...
## $ distance  <int> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 73...
## $ hour      <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6...
## $ minute    <int> 17, 33, 42, 44, 54, 54, 55, 57, 57, 58, 58, 58, 58, ...
# Ordered, grouped summary of nycflights
nycflights %>%
    group_by(carrier) %>%
    summarize(n_flights = n(),
              avg_delay = mean(arr_delay, na.rm = TRUE)) %>%
    arrange(avg_delay)
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
## # Source:     lazy query [?? x 3]
## # Database:   mysql 5.6.34-log
## #   [student@courses.csrrinzqubik.us-east-1.rds.amazonaws.com:/dplyr]
## # Ordered by: avg_delay
##    carrier n_flights avg_delay
##    <chr>       <dbl>     <dbl>
##  1 AS            714    -9.86 
##  2 HA            342    -6.92 
##  3 AA          32729     0.356
##  4 DL          48110     1.63 
##  5 VX           5162     1.75 
##  6 US          20536     2.06 
##  7 UA          58665     3.50 
##  8 9E          18460     6.91 
##  9 B6          54635     9.36 
## 10 WN          12275     9.47 
## # ... with more rows

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] dbplyr_1.2.2      RMySQL_0.10.16    DBI_1.0.0        
##  [4] data.table_1.12.0 hflights_0.1      ggplot2_3.1.0    
##  [7] dplyr_0.8.0.1     gapminder_0.3.0   kableExtra_1.0.1 
## [10] knitr_1.21       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        plyr_1.8.4        pillar_1.3.1     
##  [4] compiler_3.5.2    prettydoc_0.2.1   tools_3.5.2      
##  [7] digest_0.6.18     gtable_0.2.0      evaluate_0.12    
## [10] tibble_2.0.1      viridisLite_0.3.0 pkgconfig_2.0.2  
## [13] rlang_0.3.1       cli_1.0.1         rstudioapi_0.9.0 
## [16] yaml_2.2.0        xfun_0.4          withr_2.1.2      
## [19] httr_1.4.0        stringr_1.4.0     xml2_1.2.0       
## [22] hms_0.4.2         webshot_0.5.1     grid_3.5.2       
## [25] tidyselect_0.2.5  glue_1.3.0        R6_2.4.0         
## [28] fansi_0.4.0       rmarkdown_1.11    readr_1.3.1      
## [31] purrr_0.3.0       magrittr_1.5      scales_1.0.0     
## [34] htmltools_0.3.6   assertthat_0.2.0  rvest_0.3.2      
## [37] colorspace_1.4-0  utf8_1.1.4        stringi_1.3.1    
## [40] lazyeval_0.2.1    munsell_0.5.0     crayon_1.3.4